Follow the procedure using the interface to load the file or try to use the function read.table.
Clinical data: clinical_data.csv, name the variable clinical
Code
# or you can try to adapt this codeclinical <-read.table("datasets/clinical_data.csv",sep=",",head=T)head(clinical)
patient_id age gender smoking_status height weight
1 1 64.96714 Male Smoker 180.9910 98.83183
2 2 58.61736 Male Non-Smoker 162.6784 53.58433
3 3 66.47689 Male Non-Smoker 176.1161 76.92134
4 4 75.23030 Male Non-Smoker 165.7124 69.74749
5 5 57.65847 Male Smoker 161.1106 70.78215
6 6 57.65863 Male Non-Smoker 179.1095 68.22497
blood_pressure_systolic blood_pressure_diastolic cholesterol glucose
1 122.50182 78.18016 203.4560 67.06891
2 109.61795 75.56966 194.7213 103.10724
3 113.57004 91.31688 160.4247 119.15123
4 110.89187 107.75532 145.2192 101.06535
5 98.80158 81.78133 138.7112 89.83955
6 136.26337 83.63899 214.1575 194.89959
heart_rate hospital_visits_last_year diagnosis treatment_group
1 60.04614 2 Heart Disease Control
2 88.67212 6 Heart Disease Drug B
3 83.17836 4 No Disease Control
4 79.68278 2 Diabetes Drug A
5 57.38393 5 Hypertension Drug A
6 96.59123 0 Diabetes Control
treatment_response
1 No Change
2 No Change
3 No Change
4 No Change
5 Worsened
6 No Change
What is a plot ?
A graphical plot gives us a visual summary of total (or some parts) of a dataset. It is composed of several elements :
Data : It is what we want to represent, it is the heart of the plot.
Axis : They allow you to set visualization limits and understand data scale.
Labels : This allows you to give names to the axes, so you know what they represent. It is also possible to label data directly on the plot.
Shape : Depending on the data we have and we want to have as representation, we’ll obtain a geometric shape that can range from a simple point, through a line, to more complex forms.
How ggplot2 works ?
The “gg” in ggplot2 stands for Grammar of Graphics. So we’re going to use some special verbs to help us build our final graph.
ggplot2 basically works like a cake with several layers.
Each layer has a role to play in the final rendering. As with a real cake, it’s not necessary to use every available layer to make a good plot. The more information, colors or labels a plot contains, the more likely it is to become illegible. You need to find the right balance. (See in slides into the ggplot2 section)
Install ggplot2 and load the library
Code
install.packages("ggplot2")library(ggplot2)
Display your first plot step by step
Before unfolding the code and answers boxes, try to find the solution from what we have seen in the lecture or by searching on the web, it should become an automatism, no one knows everything !
This is what our plot will look like. Now, let’s break it down layer by layer and build it.
Use the function ggplot() alone.
Code
ggplot()
A default grey background is displayed.
Display the axis: Add the data and choose the x and y axis from the column names that you can display using colnames. For instance, we want to evaluate the distribution of the Glucose concentration (y-axis) in function of the diagnosis (x-axis).
# Same code as above but simplified, without the name of argumentsggplot(clinical, aes(diagnosis, glucose))
On this line, we tell ggplot to map diagnosis on the x-axis and glucose on the y-axis. Note that the axis are automatically labelled with the column names.
The default theme displays a grey background and a white grid. We will see that is easy to change this features.
Display the geometric shape: boxplot
Add a layer to the plot. Which is the function to use ?
You can play globally on the color of the borders (parameter color), the filling (parameter fill), the transparency (parameter alpha)) and more. The accessible aesthetics can vary according to the geometric shape, form. instance for lines you will have the width and the type.
Change the color of the boxplots in function of the type of disease
During the previous step the same colors have been applied to all boxplots, the goal here is to use the information contained in a column of our dataset (diagnosis). This is part of the aesthetics. Based on the current code, how would you add this information (2 possibilities) ?
Option 1
Code
# Directly in the aesthetics of the ggplot function that will be accessible for all layersggplot(data = clinical, mapping =aes(x=diagnosis, y=glucose, color=diagnosis)) +# Add the new aesthetics colorgeom_boxplot() +# remove the global colorslabs(x="Type of Disease", y="Glucose (mg/dL)") +theme_bw()
Option 2
Code
# Within the geom_boxplot() function which will define the aesthetics # with aes() only for this layerggplot(data = clinical, mapping =aes(x=diagnosis, y=glucose)) +geom_boxplot(aes(color=diagnosis)) +labs(x="Type of Disease", y="Glucose (mg/dL)") +theme_bw()
By default the colors are not very contrasted … you can customize them by :
Defining your own palette optionally by attributing a color to each category ;
Find a predefined palette to help you to define the colors. To do this task, you will need to add a new layer by using a function that help in scaling the colors manually.
i) You can search for how colors are named in R if you want to try your favorite colors. Many palettes are available through dedicated packages. (My favorite function to call for colors is colours() systematically installed and loaded with R, you can try some of my preferred ones: colours()[124], colours()[613], colours()[53], colours()[144]).
Key words to write on your favorite browser
R ggplot2 colors
Code
# the fonction colours() returns the name of the colorcolours()[124]
[1] "deepskyblue3"
Code
# use the function c() to concat several elements, we call it a vectorc(colours()[613], colours()[53], colours()[144])
[1] "springgreen3" "chocolate1" "gold2"
R ggplot2 boxplot color
To define your own palette, you need to concatenate as many colors as the number of categories, here 4 colors. You can assign the colors to the categories by building a named vector.
Code
# We use the function c(), to name the elements of a vector you use the name you want# and you assign the values with the sign =palette.disease <-c("Diabetes"=colours()[144],"Heart Disease"=colours()[34],"Hypertension"=colours()[613],"No Disease"=colours()[24])# Display the contentpalette.disease
Diabetes Heart Disease Hypertension No Disease
"gold2" "brown2" "springgreen3" "black"
Now you have your colors, add the layer to the ggplot object and use the variable palette.disease you just created.
ii) R natively provides a few continuous color palettes which means that you do not need to install and load any library. Also when you install ggplot2, it installs some dependencies (packages that ggplot2 needs to work properly), among them it installs the packages RColorBrewer, viridis, paletteer, khroma etc. and have implemented specific functions to use the palettes, it means that those packages are installed but not loaded. You can see which functions are available when typing scale_color_ on RStudio. If you want to use the functions from these packages you need to load the package but it is not mandatory. Finally, a lot more palettes are available through packages.
There is two ways, the first one is very simple and consists in inverting x and y aesthetics, the second one involves the addition of a new layer to flip the axis (preferred option and it provides more flexibility). Lets keep our own color palette palette.disease
This task consist of adding a new layer containing the points colored by known oncogenes. Which function should we use ? Note that there are two answers but one is more suitable for boxplot as it is adding some random noise to the positions in order to avoid overlaps. Set the size of the points to 0.3
Your goal is to produce the following density plot showing the density distributions of the glucose concentration in function of the diseases (use palette.disease). Here we introduce the customization of the plot dimensions, we want to force the plot to have be x-axis 2 times longer than the y-axis which we call the aspect ratio. Thanks to it, you can force to have squared plots for instance.
Identify what contain the axis.
Answer
The x-axis contains the value of the column glucose and the y-axis indicates the density of the distribution, the y-axis do not need to be specified.
Make a list of the aesthetics you need to modify and layers that need to be added.
Answer
Layers : the geometric shape, label of the x-axis, theme background, theme aspect ratio, custom color palette
Aesthetics : x-axis , color of the lines
Write your code ! (Do not forget your best friend: the Web)
Your goal is to produce the following bar plot showing the number of patient that are smoking (column smoking_status - play with the transparency) in each outcome. In this case the aspect ratio is inverse compared to the previous one.
Identify what contain the axis.
Answer
The x-axis indicates the number of sample falling in the different outcomes, the x-axis do not need to be specified. The y-axis indicates the different outcomes. As a reminder, one would prefer to flip the coordinates but it is not mandatory.
Make a list of the aesthetics you need to modify and layers that need to be added.
Answer
Layers: the geometric shape, label of the y-axis, theme background, theme aspect ratio, custom color palette
Aesthetics: y-axis , color of the lines, fill of the bars, transparency according to gender
Write your code ! (Do not forget your best friend: the Web)
Warning: Using alpha for a discrete variable is not advised.
Changing layout of the data
You are working on your dataset. You are happy because you have brilliantly succeed on doing plots from your data.
But there’s one question on your mind (maybe) : Does the smoking status seem to be associated with other factors ?
The answer to this question comes quickly. “We could plot the proportion of smokers in each category !”
Option 1 (not recommended)
You can plot independently the four barplots. It’s a possibility, after all we only have 4 factors to compare. But this method becomes more time-consuming and laborious when several dozen factors are involved.
Our goal is to obtain a table in which we have all combinations of smoking habits and other features for each patient, in other words we want a line per feature and its measure in each patient and its smoking habit. The column containing the features (gender, diagnosis, treatment_group, treatment_response) will be use to split the plot into 4 panels, one per feature. This is called faceting.
i) We will first extract a data.frame containing the qualitative values and store it in a variable named qual.features. To do that use the functions provided by dplyr.
Tips
You need to use a combination of functions that allows to select columns according to a type of variable. You can find the answer in the dplyr tutorial or on the web helped or not by an artificial intelligence.
gender smoking_status diagnosis treatment_group treatment_response
1 Male Smoker Heart Disease Control No Change
2 Male Non-Smoker Heart Disease Drug B No Change
3 Male Non-Smoker No Disease Control No Change
4 Male Non-Smoker Diabetes Drug A No Change
5 Male Smoker Hypertension Drug A Worsened
6 Male Non-Smoker Diabetes Control No Change
ii) We will “lengthens” the dataset qual.features so that we obtain a longerdata.frame (named qual.features.long) where you will have the measures corresponding the smoking status and all other measures. The smoking_status column should be kept in the final table, the other features should be stored in a new column named features and the values stored in a column named values.
smoking_status features values
<char> <char> <char>
1: Smoker gender Male
2: Smoker diagnosis Heart Disease
3: Smoker treatment_group Control
4: Smoker treatment_response No Change
5: Non-Smoker gender Male
---
39996: Non-Smoker treatment_response No Change
39997: Non-Smoker gender Female
39998: Non-Smoker diagnosis Heart Disease
39999: Non-Smoker treatment_group Control
40000: Non-Smoker treatment_response No Change
=> Modify dataset layout with pivot_longer() and pivot_wider()
We can change the layout of a dataset. It means, for example, transform columns into rows or the invert. This is a common task in data analysis, particularly when you need to organize data in a structure that is best suited for your analysis.
In R, this often involves transforming data between long and wide format. In a wide format, each row represent a single observation, and columns represent variables (like in qual.features). In a long format, each row is a single measurement, with a separate column for the variables and the values. This structure is particularly useful for certain graphical representations.
To switch from one shape to another, there are two functions called pivot_longer() and pivot_wider(). pivot_longer() function enables to convert multiple columns into key-value pairs, where column names become variable names, and their corresponding values are stacked. pivot_wider() function allows to spread key-value pairs across multiple columns. We say that pivot_longer() reshapes wide format to long format and pivot_wider() do the invert.
Try to play the pivot_longer() command as it is (don’t forget to use our data) … you obtain an error:
Error in pivot_longer():
! Arguments in ... must be used.
✖ Problematic arguments:
• ..1 = “features”
• ..2 = “values”
ℹ Did you misspell an argument name?
Run rlang::last_trace() to see where the error occurred.
Explore the pivot_longer() functions and identify the names of the parameters that are set in the figure example. Describe what is happening.
Answer
Search on the web or look at the help of each function.
?pivot_longer
In R, when the names of the parameters are not indicated, the order of the arguments is very important. It will correspond to the order of the arguments defined in the function.
The first parameter data is automatically filled by the stream %>% and take the data frame qual.features.
The second parameter cols do not have default argument and will take the first argument specified in the command: !smoking_status.
This argument indicates the column that should NOT be spread (the logical NO! makes the negation), indeed we want to keep this column in the resulting table as it will be use as a layer in our plot. If the ! was not present than the specified columns are spread.
The third parameter ... is specifically designed for additional column selectors, this means you can use multiple column selection methods or helpers (like starts_with(), contains(), etc.) to specify the columns to be pivoted, but most of the time you can achieve your goal with just cols, as the need for ... is rare in practice for this function. The argument "features" and "values" in ... are not being correctly understood by the function. No such functions exist. It is what the error says:
! Arguments in ... must be used.
✖ Problematic arguments:
• ..1 = “features”
• ..2 = “values”
From here you need to find the appropriate arguments to set the name of the column that will contain the column names to features and set the name of the column that will contain the values corresponding to the pairs smoking_status-features to values.
Here it is the answer:
Code
qual.features.long <- qual.features %>%pivot_longer(# Exclude smoking_status, pivot other columnscols =!smoking_status, # A character vector specifying the new column or columns to create from # the information stored in the column names of data specified by cols.names_to ="features", # A string specifying the name of the column to create from the data stored in cell values.values_to ="values" )
At least three ways of writing the code are possible.
Way 1
Code
# specifying the columns to pivotqual.features.long <- qual.features %>%pivot_longer(cols =c(gender,diagnosis,treatment_group,treatment_response),names_to ="features", values_to ="values")
Way 2
Code
# specifying the columns to pivotqual.features.long <- qual.features %>%pivot_longer(cols =!smoking_status,names_to ="features", values_to ="values")
Way 3 (tricky)
Code
# relocate smoking_status at the first column and spread from column gender to treatment_responsequal.features.long <- qual.features %>%relocate(smoking_status,.before = gender) %>%pivot_longer(cols=gender:treatment_response,names_to ="features", values_to ="values")
You can note that the number of rows after gathering is equals to the number of rows in the original data (10000) times the number of gathered columns (4) reflecting all the combinations of smoking_status and all 4 features for each patient.
iii) We will plot the proportion of smokers (opacity) in each category value, using faceting to wrap the other features.
Warning: Using alpha for a discrete variable is not advised.
Correlation plot
Using dplyr functions, select columns containing numeric values in clinical. Caution, the patient_id column contains numeric and we do not want it, we need to combine 2 conditions.
You have a very nice introduction to this package: here. All the information are in this page but don’t hesitate to search for more.
You want to obtain the following plot:
Here is the solution:
Code
# install the package that is in CRAN repositoryinstall.packages("corrplot")# load the librarylibrary(corrplot)# compute the correlation, by default it compute the correlation between columnsM <-cor(quant.features) # the plot !corrplot(M, method ='square',addCoef.col ='black',order ='AOE',type ='upper',diag = F,cl.pos ='b')
Exercise
You first need to load the data.
Follow the procedure using the interface to load the file or try to use the function read.table.
Clinical data: TCGA_LUAD_subset.tsv, name the variable luad_clinical
Expression data : TCGA_LUAD_expression.tsv, name the variable luad_expr
Differential Expression Gene in KRAS tumors between high and low MAPK signature : TCGA_KRAS_DEG_MAPK.tsv, name the variable KRAS_DEG
# or you can use R commandluad_clinical <-read.table("datasets/TCGA_LUAD_subset.tsv",sep="\t",head=T)luad_expr <-read.table("datasets/TCGA_LUAD_expression.tsv",sep="\t",head=T,row=1)KRAS_DEG <-read.table("datasets/TCGA_KRAS_DEG_MAPK.tsv",sep="\t",head=T)
Questions
All answers are plots with sometimes some preliminary data formating
The answers we provide are not the unique solutions rather just examples, you can find multiple ways to show tendencies in your data.
Here are some colors you could use …
# colours for oncogenespalette.known.onco <-c(ALK="darkgrey",BRAF_act=colours()[613],BRAF.non=colours()[11],EGFR=colours()[128],ERBB2=colours()[76],MAP2K1=colours()[509],MET=colours()[121],NF1=colours()[468],None=colours()[92],RIT1=colours()[642],ROS1=colours()[34],KRAS="black")
Are the oncogenes (column known.oncogenes) evenly represented in the dataset ?
Are there differences in MAPK activity (column signature_MAPK) in tumors depending on the mutated oncogene they are harboring (column known.oncogenes) ?
Researchers are particularly interested in some regulators of the MAPK signaling, among them we focus on DUSP4, DUSP6, SPRY2. How the expression of these genes globally correlates with the MAPK signature ?
3.1. Select the rows corresponding the phosphatase genes in the expression matrix, transpose the table, change its type to data.frame and add a column named sampleID containing the row names.
In this section, we want to visualise the gene expression across all samples using a heatmap and highlight potential expression patterns using sample and gene clustering.
What is a heatmap ?
In a heatmap, each row usually represents a gene. Each column represents a sample. The rectangles containing the values are also called “cells”.
The colour and intensity of the cells is used to represent values of gene expression. So, basically, instead of numbers, we use colours. The colour will be more intense the lower (or higher) the value is.
To see patterns, we cluster (meaningful reordering) the rows and columns. This just means we group the samples and/or genes together based on the similarity of their gene expression pattern. The dendograms on the sides just indicate the results of clustering both genes and samples.
A clustering algorithm will group genes with a similar expression across samples together: groups of genes that are highly expressed in some samples, and lowly expressed in other samples will be clustered together. In a similar way, samples with a similar expression pattern (in general having certain sets of genes highly expressed and certain sets of genes lowly expressed) will be clustered together.
In summary, adding clustering to heatmaps can be useful for identifying genes that are commonly regulated, or biological signatures associated with a particular condition (for example, high MAPK activity).
Step by step
Before starting …
i) We need to find a package that allow us to draw (pretty) annotated heatmaps. Lets search on the web what we could try.
Key words to write on your favorite browser
R How to draw annotated heatmap
The package we will be using
pheatmap
ii) Install the package.
Code
install.packages("pheatmap")library(pheatmap)
Step 1 : Prepare the data
To be able to annotate the heatmap we need to build the annotations for the samples (luad_clinical) and the genes of interest (DUSP4, DUSP6 and SPRY2) and associate colours to the categories for each feature.
i) Annotation of the samples
Using the variable clinical, extract the the columns known.oncogenes, pathologic_stage, gender, signature_group in a new variable anno.columns.
known.oncogenes pathologic_stage gender
1 KRAS Stage IB MALE
2 KRAS Stage IIIA FEMALE
3 RIT1 Stage IIIA MALE
4 BRAF_act Stage IA MALE
5 KRAS Stage IB FEMALE
6 KRAS Stage IIIB MALE
To allow the tool to map the annotations to the samples in the expression dataset, we need to name the row with the sample identifiers that are stored in the column sampleID of clinical.
known.oncogenes pathologic_stage gender
TCGA.05.4249.01 KRAS Stage IB MALE
TCGA.05.4250.01 KRAS Stage IIIA FEMALE
TCGA.05.4384.01 RIT1 Stage IIIA MALE
TCGA.05.4389.01 BRAF_act Stage IA MALE
TCGA.05.4390.01 KRAS Stage IB FEMALE
TCGA.05.4395.01 KRAS Stage IIIB MALE
To define the colours for each feature, you need to build a list using the constructor function list() in which each element will be a named vector of colors as you have already made to create palette.known.onco. For that you have to know all possible categories for each feature. table() is an easy and very useful function to display all categories with the number of sample per category.
table(anno.columns$gender)
FEMALE MALE
122 124
table(anno.columns$pathologic_stage)
Stage I Stage IA Stage IB Stage IIA Stage IIB Stage IIIA Stage IIIB
3 58 68 21 36 40 6
Stage IV
14
table(anno.columns$pathologic_stage)
Stage I Stage IA Stage IB Stage IIA Stage IIB Stage IIIA Stage IIIB
3 58 68 21 36 40 6
Stage IV
14
etc
You can see that for the pathological stages, there are subcategories A and B that we want to regroup in order to simplify the visualisation. To group the sub-categories, we will remove the A and B from the end of the words using the function sub(). Try to write the code, many answers are possible.
Here are 2 possibilities:
anno.columns$pathologic_stage <-sub("A","",anno.columns$pathologic_stage)anno.columns$pathologic_stage <-sub("B","",anno.columns$pathologic_stage)# The cleanest way of coding this task isanno.columns$pathologic_stage <-sub("[AB]$","",anno.columns$pathologic_stage)# [AB] indicates that at the last position (sign $) you encounter either A or B
table(anno.columns$pathologic_stage)
Stage I Stage II Stage III Stage IV
129 57 46 14
ii) Annotations of the genes
We want to identify the genes belonging to the MAPK activity signature. To this end, we create a data.frame using the constructor function data.frame() with the row names corresponding to the gene names and containing one column in.signature containing yes or no.
Create a vector containing “yes” or “no” depending on the gene presence in the signature. Many possibilities exist but the best way is to use the function ifelse(). This function takes 3 arguments, an expression comparing the gene names (row names of the table luad_expr) with the genes of interest (gene.of.interest), what the function must return if the comparison is TRUE (the gene is in the signature, “yes”), what the function must return if the comparison is FALSE (the gene is not in the signature, “no”).
To know if the elements of a vector are present in an other vector we use the operator %in%.
Try to write the code that allow to obtain the yes/no vector using ifelse() and genes.in.signature and keep the results in a variable called in.signature.
# Create annotation for the rows containing the genes: # we want to indcate the genes involved in the signatureanno.rows <-data.frame(row.names=row.names(luad_expr),in.signature)head(anno.rows)
in.signature
HLF no
SIDT1 no
KCNK5 no
NT5E no
ANKFN1 no
OSR2 no
Create a list colors.anno, specifying the colors of each category of each feature
# create a list containing the colors for each featurecolors.anno <-list(known.oncogenes=palette.known.onco,pathologic_stage=c("Stage I"=colours()[404], "Stage II"=colours()[33], "Stage III"=colours()[35], "Stage IV"=colours()[24]),gender=c(FEMALE=colours()[613], MALE=colours()[290]),in.signature=c(yes="black",no="white"))
Step 2 : Create a clustered heatmap
Display the heatmap with default parameters (do not change any parameters) using the function pheatmap() and the expression object luad_expr.
Code
pheatmap(luad_expr)
By default, pheatmap applies hierarchical clustering to both rows and columns. The parameter names are quite explicit.
By looking at the help of the function (?pheatmap) or online, identify the parameters and their default values that:
Enable row or column clustering (2 parameters)
Answer
cluster_cols = TRUE
cluster_rows = TRUE
specify the statistics to compute the distances between rows or columns (2 parameters)
specify how to aggregate the rows and columns based on their distance (1 parameters)
Answer
clustering_method = "complete"
Step 2 : Customize your heatmap
By analysing the default heatmap, we can notice that column labels are not readable and are not adding relevant information, we want to remove them, the row names are readable but we want to decrease the size of the font. We will test different method for clustering the rows and columns. We would like to change the colors of the cells to make them more divergent using for instance the color palette from viridis and we want to intensify the differences by lowering the range value amplitude (between -4 and 4), meaning that values outside of this range will display the extreme colors. We then want to annotate the samples and genes with nice colors, separate the clusters and finally add a tittle.
i) Row and column names
By looking at the help of the function (?pheatmap) or online, identify the parameters (and adapt their values) that:
Enable the display of the column names and disable it.
Answer
show_colnames = T # default
Enable to decrease the font size
Answer
fontsize = 10 # default
Code
pheatmap(luad_expr,cluster_rows = T, cluster_cols = T, # set to FALSE if you want to remove the dendogramsshow_colnames = F, # displaying column namesshow_rownames = T, # displaying row namesfontsize =8# change font size )
ii) Change the colors
By looking at the help of the function (?pheatmap) or online, identify the parameters (and adapt their values) that:
Enable to set the color palette. In the example we used the inferno palette from the package viridis and asked for 100 colors.
Answer
color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(100) # Default
Enable to change the amplitude of the colors between -4 and 4
Answer
breaks = NA # default
Code
pheatmap(luad_expr,cluster_rows = T, cluster_cols = T, # set to FALSE if you want to remove the dendogramsshow_colnames = F, # displaying column namesshow_rownames = T, # displaying row namescolor = viridis::inferno(100), # choose a colour scale for your databreaks =seq(-4,4,0.08), # seq produce a vector from -4 to 4 with a step of 0.08 (100 values)fontsize =8# change font size )
iii) Choose the clustering methods
Using the parameters you identified earlier for clustering methods, test several combinations to observe the impact on the results. You can find bellow the definition of the different methods. In the example, we chose to cluster following the correlationdistance and the ward.Dagglomeration methods.
Distance methods
Euclidean Distance: This is like drawing a straight line between two points. It’s the most intuitive way to measure distance, like the shortest path “as the crow flies” between two locations on a map.
Manhattan Distance: Think of how you would walk in a city laid out like a grid, where you can only move horizontally or vertically. The Manhattan distance measures the total number of blocks you would walk.
Correlation Distance: Instead of focusing on actual distance, this measures how similar the patterns or trends are between points. It’s useful when you care about how things change together rather than how far apart they are.
Maximum Distance (Chebyshev): This only looks at the largest difference in any single direction (horizontal or vertical). Imagine trying to figure out how far two buildings are by only checking the longest side of the blocks between them.
Canberra Distance: This one gives more weight to differences when values are small. It’s useful when small differences matter a lot in your data.
Binary Distance: This is used for yes/no or presence/absence type data. It simply compares if two points have the same “yes/no” values.
Minkowski Distance: This is a general form of both Euclidean and Manhattan distances. By changing a parameter, it can behave like either Euclidean (straight line) or Manhattan (grid-based).
Agglomeration methods
Ward’s Method (ward.D, ward.D2): This tries to form groups that minimize the difference within each cluster. It prefers clusters where data points are very similar to each other. It’s great for balanced, compact groups.
Single Linkage (single): This method focuses on the smallest distance between points in two clusters. It can create “chains” of points that are loosely connected, leading to long, stretched-out clusters.
Complete Linkage (complete): This method uses the largest distance between points in two clusters. It prefers tight, compact clusters, and avoids the stretched-out effect of single linkage.
Average Linkage (average/UPGMA): Here, we average the distances between all pairs of points in two clusters. It balances between tight and loose clusters, forming more moderate-sized groups.
McQuitty’s Method (mcquitty/WPGMA): Similar to average linkage, but it calculates the average in a slightly different way. It’s another method to balance between close and distant points.
Median Method (median/WPGMC): This method looks at the median distances between points, creating balanced clusters that can handle uneven data well.
Centroid Method (centroid/UPGMC): This method calculates the “center” of clusters (think of the average location of all points in the cluster) and uses that to decide how to group them.
Code
pheatmap(luad_expr,cluster_rows = T, cluster_cols = T, # set to FALSE if you want to remove the dendogramsshow_colnames = F, # displaying column namesshow_rownames = T, # displaying row namesclustering_distance_cols ='correlation',clustering_distance_rows ='correlation',clustering_method ='ward.D', # how to aggregate rows and columns depending on their distancecolor = viridis::inferno(100), # choose a colour scale for your databreaks =seq(-4,4,0.08), # seq produce a vector from -4 to 4 with a step of 0.08 (100 values)fontsize =8# change font size )
iv)Separate clusters
By looking at the help of the function (?pheatmap) or online, identify the parameters (and adapt their values) that:
Enable to show distinct clusters, in the example, 6 sample clusters and 4 gene clusters (2 parameters)
Answer
cutree_rows = NA # Default
cutree_cols = NA # Default
Code
pheatmap(luad_expr,cluster_rows = T, cluster_cols = T, # set to FALSE if you want to remove the dendogramsshow_colnames = F, # displaying column namesshow_rownames = T, # displaying row namesclustering_distance_cols ='correlation',clustering_distance_rows ='correlation',annotation_row = anno.rows, # row (gene) annotationsannotation_col = anno.columns, # column (sample) annotationsannotation_colors = colors.anno, # colours for your annotationsclustering_method ='ward.D', # how to aggregate rows and columns depending on their distancecolor = viridis::inferno(100), # choose a colour scale for your databreaks =seq(-4,4,0.08),cutree_cols =6, # display column clusterscutree_rows =4, # display row clustersfontsize =8# change font size )